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1  Abstract 


This  report  provides  an  indepth  examination  of  the  accomplishments  of  the  recently  com¬ 
pleted  CNS  project.  Research  contributions  of  this  project  span  a  wide  range  of  fields: 
digital  microsystems  design,  computer  architecture,  multiprocessor  system  design,  neuro- 
morphic  analog  processing,  parallel  programming  languages,  and  neural  network  applica¬ 
tions.  Though  numerous  conference  presentations  and  peer-reviewed  journal  articles,  these 
advances  have  had  broad  impact  in  the  university,  industrial,  and  government  communities. 
The  research  project  also  served  to  train  many  graduate  students  in  a  variety  of  disciplines 
in  computer  science,  electrical  engineering,  and  related  scientific  disciplines. 


2  CNS:  A  Five  Year  Overview 


Our  primary  motivation  for  the  CNS  project  sprung  from  a  simple  question:  given  the 
neural  network  paradigms  in  common  use  in  1991,  would  neural  networks  with  several 
million  parameters  be  useful  for  engineering  applications  and  neuroscience  research? 

Experimenting  with  networks  of  this  size  on  real  problems  was  not  feasible  with  off-the- 
shelf  computing  hardware  available  at  the  start  of  the  project.  Neural  network  researchers, 
working  by  themselves  with  commercial  hardware,  could  not  experimentally  attack  the 
question.  However,  we  believed  that  a  team  of  computer  system  designers  and  neural 
network  experts  could  answer  this  question,  by: 


•  Defining  a  flexible,  general-purpose  computing  architecture  for  neural  computing. 
This  architecture  should  included  provisions  for  real-time  input  from  real-world  sen¬ 
sors. 

•  Designing  a  hardware  implementation  of  this  architecture  that  would  support  the 
simulation  of  neural  systems  with  millions  of  weights,  at  a  peak  computation  rate  of 
1  billion  connection  updates  per  second. 

•  Defining  software  tools  and  programming  techniques  that  enable  application-level  pro¬ 
grammers  to  realize  the  full  computational  power  of  this  hardware. 

•  Providing  a  complete  computing  solution  to  the  CNS  applications  community  based 
on  these  new  technologies,  that  functions  as  a  reliable  platform  for  research. 
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•  Construct  dozens  of  copies  of  this  prototype  hardware,  for  use  by  a  community  of  neu¬ 
ral  network  researchers,  and  guide  this  community  through  experiments  with  multi- 
million-parameter  systems. 

Judged  by  this  list  of  goals,  the  CNS  project  was  a  success.  During  its  five  year  lifespan, 
we: 

•  Created  the  CNS  computing  architecture,  featuring  a  specialized  microprocessor  of 
our  own  design,  TO,  and  specialized  analog  pre-processing  chips  for  real-time  auditory 
input. 

•  Designed  and  fabricated  the  TO  microprocessor,  a  750,000  transistor  CMOS  chip  that 
ran  at  40  Mhz,  and  manufactured  hundreds  of  copies  of  this  fully-fu notional  part. 

•  Constructed  over  50  complete  SPERT  computing  systems,  and  distributed  these  sys¬ 
tems  to  active  researchers  in  the  neural-network  community. 

•  Created  a  complete  software  development  and  operating  system  environment,  to  sup¬ 
port  neural-network  application  development  on  TO  and  SPERT. 

•  Developed  and  distributed  worldwide  the  object-oriented  programming  language  Sather 
and  pioneered  several  basic  ideas  in  parallel  software  for  irregular  problems  in  the 
pSather  version. 

•  Created  complete  neural-network  applications  that  used  SPERT  systems.  These  ap¬ 
plications  used  neural  networks  with  millions  of  parameters  to  solve  significant  engi¬ 
neering  problems. 

•  And  finally,  by  combining  multiple  copies  of  the  SPERT  system,  we  produced  a  multi¬ 
processing  neural  network  acceleration  system,  TetraSPERT.  TetraSPERT  is  capable 
of  simulating  a  neural  network  with  5  million  weights,  while  achieving  a  computational 
throughput  of  over  a  billion  connection  updates  per  second.  We  ported  complete  ap¬ 
plications  to  TetraSPERT,  closing  the  loop  between  supercomputer  system  design  and 
neural  network  research. 

The  preceding  description  highlights  the  project  engineering  success  of  the  CNS  project: 
chip  designs  completed,  numbers  of  machines  fabricated,  etc.  However,  as  an  academic 
research  group,  our  primary  mission  is  not  project  engineering.  Exploring  new  approaches 
to  problems,  creating  new  engineering  techniques,  and  formulating  scientific  knowledge  are 
the  true  missions  of  an  academic  research  project.  The  research  contributions  of  the  CNS 
project  span  several  disciplines  in  computer  science  and  electrical  engineering,  as  detailed 
below. 


2.1  Computer  Architecture 

At  the  beginning  of  the  CNS  project,  the  literature  on  digital  systems  for  neural  networks 
focused  on  embedding  specific  neural  training  and  evaluation  algorithms  in  architectures 
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optimized  for  maximum  throughput.  To  begin  our  project,  we  studied  the  suitability  of 
this  specialized  acceleration  to  an  important  application  for  the  CNS  research  community 
-  state-of-the-art  speech  recognition  using  neural  networks. 

We  found  that  speech  recognition  applications  spend  a  significant  fraction  of  execu¬ 
tion  time  performing  algorithms  unrelated  to  neural  networks  -  tasks  such  as  speech  pre¬ 
processing  and  hidden  Markov  model  state  decoding.  These  tasks  would  not  be  able  to 
use  specialized  neural  network  hardware  -  and  thus  the  total  speedup  of  the  speech  recog¬ 
nition  applications  would  be  limited,  regardless  of  how  fast  the  neural-network  hardware 
computed. 

This  observation  -  which  we  dubbed  “Amdahl’s  Law  for  Neurocomputing”  -  drove  the 
CNS  project  in  a  new  direction,  towards  hardware  architectures  that  accelerated  neural 
computations  in  a  general-purpose  framework.  We  found  that  a  classical  machine  archi¬ 
tecture  for  scientific  computation  -  a  vector  unit  combined  with  a  fast  scalar  processor  - 
could  be  reworked  to  serve  the  needs  of  the  CNS  project.  Compared  to  other  approaches 
to  acceleration,  a  vector  processor  is  an  easy  platform  for  both  compilers  and  application 
programmers  to  target.  Our  reworking  of  the  vector  machine  concept  for  the  CNS  project, 
embodied  in  our  TO  microprocessor  architecture,  features  these  novel  enhancements: 


•  In  addition  to  the  standard  arithmetic  ALU  operations,  our  vector  ALUs  include 
special  opcodes  optimized  for  fast  neural  network  acceleration,  including  instructions 
which  support  the  evaluation  of  many  weight  multiplications  in  parallel. 

•  Instead  of  the  large  floating-point  formats  of  classical  vector  machines,  our  vector 
registers  and  ALUs  support  a  lean  fixed-point  data  format,  with  sufficient  accuracy 
and  precision  for  both  neural-network  and  general  signal  processing  tasks. 

•  Special  data  movement  and  combination  instructions  between  vector  registers  and  the 
scalar  unit  allow  many  algorithms  to  sustain  the  maximum  bandwidth  throughout  a 
task. 

•  Support  for  efficient  vector  load  and  store  operations  over  a  high-bandwidth  interface 
to  off-chip  memory. 


These  architectural  concepts  have  found  academic  and  industrial  acceptance  far  be¬ 
yond  the  neural-network  acceleration  community.  The  multimedia  extensions  to  the  Intel 
Pentium  architecture,  MMX,  are  a  clear  application  of  the  “Amdahl’s  Law  for  Neurocom¬ 
puting”  principle  of  general-purpose  architecture  for  special-purpose  acceleration,  targeted 
to  graphics  and  multimedia  applications.  In  addition,  the  vector  architecture  of  TO  has 
led  to  an  exploration  of  vector  architectures  for  mainstream  computing.  For  example,  the 
Intelligent  DRAM  (IRAM)  project,  led  by  Professor  Dave  Patterson  of  UC  Berkeley,  adopts 
a  vector  architecture  for  embedded  DRAM  computing  that  shares  many  features  with  TO. 
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2.2  Digital  VLSI  Design 

The  TO  microprocessor  was  a  major  VLSI  design  effort:  a  750,000  transistor  chip  in  a  1.2u 
CMOS  process,  that  achieved  fully  functional  first  silicon.  Research  advances  were  made  in 
circuit  design  and  microarchitecture  as  part  of  this  VLSI  effort. 

As  part  of  the  TO  project,  we  developed  a  novel  technique  for  high-speed  chip-to-chip 
communication.  We  designed  a  serial  link  interface  that  transfers  bits  between  chips  at  a 
rate  of  550  Mbps.  This  interface  implements  a  low  voltage  signalling  strategy,  using  an  on- 
chip  voltage  reference.  Two  delay-locked  loops  are  used  to  maintain  data  synchronization. 
We  fabricated  a  test  chip  using  this  interface  technology,  and  measurements  on  the  chip 
confirmed  reliable  communications  performance  at  full  speed  (275  MHz  clock,  two  bits  per 
cycle,  to  yield  550  Mbps). 

2.3  Software  Design 

In  addition  to  VLSI  chip  design  and  board-level  hardware  design,  software  design  was  a 
key  part  of  the  CNS  project.  A  full  operating  system  was  designed  for  SPERT,  and  in¬ 
cluded  multiprocessing  features  to  support  TetraSPERT.  A  custom  neural-network  training 
package,  QuickNet,  was  the  key  middleware  application  of  the  CNS  project.  This  training 
system  ran  on  SPERT,  TetraSPERT,  and  conventional  workstations,  and  was  the  interface 
speech-recognition  applications  used  to  access  CNS  machines. 

Another  software  package  developed  during  the  CNS  project  was  the  PHiPAC  (Portable 
High  Performance  Ansi  C)  system.  PHiPAC  is  a  system  that  generates  optimal  matrix- 
matrix  multiply  code  for  a  specific  machine  architecture.  As  neural-network  software  makes 
liberal  use  of  matrix-matrix  operations,  PHiPAC  can  be  used  to  optimize  the  network  train¬ 
ing  and  evaluation  speed  on  a  particular  machine  architecture.  We  have  facilitated  technol¬ 
ogy  transfer  by  offering  PHiPAC  distribution  via  the  Internet;  the  package  has  attracted  a 
substantial  user  base  in  the  numerical  software  community. 

2.4  Neural  Network  Applications 

Continuous  speech  recognition  was  the  driving  application  of  the  CNS  project.  The  speech 
recognition  algorithms  used  in  the  CNS  project  are  a  blend  of  neural  networks  (multi-layer 
perceptrons  (MLPs)  trained  with  the  backpropagation  algorithm  for  phonetic  classification) 
and  traditional  techniques  (hidden  Markov  models  (HMMs)  for  word  recognition). 

The  theoretical  underpinnings  of  this  hybrid  HMM-MLP  architecture  were  well  under¬ 
stood  at  the  start  of  the  CNS  project,  and  demonstration  systems  were  developed  for  speech 
recognition  tasks  of  moderate  size.  During  the  course  of  the  CNS  project,  hybrid  HMM- 
MLP  systems  were  developed  for  very  large,  state-of-the-art  speech  recognition  tasks.  These 
systems  performed  as  well  as  traditional  speech  recognition  algorithms  on  these  benchmarch 
tasks,  while  using  significantly  fewer  parameters.  Motivated  by  the  advantages  of  smaller 
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parameter  sets  (faster  recognition  time,  smaller  RAM  requirements),  several  commercial 
speech  recognition  vendors  have  adopted  our  HMM-MLP  architecture  for  their  product 
lines. 

The  hardware  developments  in  the  CNS  project  were  central  to  the  successful  develop¬ 
ment  of  large-vocabulary  hybrid  HMM-MLP  recognizers.  A  key  bottleneck  in  developing 
these  recognizers  is  the  training  of  large  (1-10  million  parmeters)  neural  networks  to  eval¬ 
uate  candidate  algorithmic  ideas.  During  the  later  years  of  the  CNS  project,  each  speech 
recognition  researcher  had  a  dedicated  SPERT  system,  using  the  TO  microprocessor,  in¬ 
stalled  in  his  or  her  workstation.  In  addition  to  these  dedicated  systems,  the  TetraSPERT 
system,  with  4  SPERT  cards,  was  a  shared  computational  resource  for  the  group. 

This  hardware  enabled  the  fast  training  of  these  multi-million  parameter  MLPs,  dra¬ 
matically  increasing  research  productivity.  Algorithmic  advances  by  CNS  researchers  in 
speech  recognition  includes: 

•  New  algorithms  to  lessen  the  effect  of  variable  speaking  rates  on  recognition  accuracy. 

•  New  algorithms  to  lessen  the  effect  of  foreign  accents  on  recognition  accuracy. 

•  Several  generations  of  speech  pre-processing  systems  that  improve  recognition  accu¬ 
racy  in  the  presence  of  background  noisy  environments. 

•  Several  generations  of  adaptive  techniques  that  lesson  the  the  effect  of  a  microphone’s 
frequency  response  on  recognition  accuracy. 

•  Advanced  formulations  of  the  statistical  foundations  of  hybrid  MLP-HMM  systems, 
which  better  capture  the  underlying  structure  of  the  speech  code. 

•  Advancements  in  the  parallelization  of  hidden  Markov  model  state  decoding,  to  im¬ 
prove  the  recognition  time  of  very-large- vocabulary  recognition  systems. 

•  New  methods  of  optimally  combining  the  results  of  several  different  speech  recognition 
algorithms. 

•  New  ways  of  applying  psych olinguistic  knowledge  to  speech  recognition. 

In  addition  to  speech  recognition,  the  CNS  project  also  supported  application  develop¬ 
ers  in  vision  and  image  compression.  UCB-based  researchers  in  these  fields  were  trained 
to  target  their  applications  to  SPERT,  and  benchmarks  were  performed  that  showed  the 
advantages  of  vector  microprocessing  on  low-level  vision  applications. 

2.5  Neuromorphic  Analog  VLSI  Design 

At  the  beginning  of  the  CNS  project,  basic  circuit  techniques  for  analog  VLSI  implemen¬ 
tations  of  neuromorphic  systems  were  well  established.  Several  research  laboratories  had 
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developed  prototype  chips  that  showed  the  promise  of  biologically-inspired  approaches  to 
vision  and  auditory  sensory  processing.  Vision  chips  that  combined  photoreceptors  and  spa- 
tiotemporal  post-processing  had  progressed  through  several  generations  of  designs.  Many 
auditory  chips  were  also  developed,  including  several  generations  of  cochlear  designs,  as  well 
as  post-cochlear  processing  for  pitch,  spectral  shape,  and  spatial  localization. 

These  prototype  chips,  however,  were  designed  primarily  to  prove  out  circuit  and  archi¬ 
tectural  techniques.  A  main  challenge  of  the  CNS  project  was  to  leverage  this  knowledge 
base,  and  create  sensory  chips  that  were  useful  as  input  devices  for  the  CNS  computing 
system.  We  focused  on  auditory  sensory  chips. 


2.5.1  Special  Purpose  A/D  Converters  for  Audio 

The  traditional  way  to  add  real-world  input  to  a  digital  system  is  to  use  a  general-purpose 
analog-to-digital  (A/D)  converter.  To  incorporate  analog  processing  in  this  model,  we 
needed  to  develop  a  replacement  for  general-purpose  A/D’s.  We  call  these  replacement 
parts  special-purpose  analog-to-digital  converters  for  auditory  processing. 

Analogous  to  general-purpose  A/D  converters,  special-purpose  A/Ds  take  as  input  an 
analog  audio  signal,  and  produce  a  digital  output  suitable  for  computer  input.  However,  the 
digital  output  is  not  digitized  shape  of  the  analog  waveform.  Rather,  the  audio  input  is  first 
processed  by  analog  circuits  that  perform  operations  unique  to  the  application.  The  final 
representation  is  then  digitized,  using  a  representation  that  codes  the  signal  in  an  efficient 
way.  In  addition  to  a  digital  output,  the  converters  also  receive  digital  input,  to  customize 
the  analog  processing  for  a  specific  application. 

For  the  CNS  project,  we  designed  and  fabricated  special-purpose  converters  for  speech 
and  audio  processing.  These  converters  feature  neuromorphic  algorithms  cast  in  analog 
circuits,  including  a  silicon  cochlea  circuit,  temporal  adaptation  circuits,  and  temporal 
autocorrelation  circuits.  The  final  output  is  coded  using  the  event-address  protocol,  a 
technique  optimized  for  the  efficient  digital  transmission  of  neuromorphic  representations. 
The  output  bus  of  the  converters  is  cascadable,  supporting  systems  that  include  several 
converters,  each  computing  a  different  representation  on  the  audio  input  in  parallel.  The 
converters  include  non-volatile  analog  storage  to  hold  control  parameters;  this  storage  is 
programmable  under  digital  input. 


2.5.2  Speech  Recognition  Experiments  Using  Silicon  Auditory  Models 

We  built  a  multi- representation  speech  processing  system,  using  three  copies  of  our  special- 
purpose  converter  chip.  We  interfaced  this  system  to  a  workstation,  and  designed  a  real-time 
software  system  for  data  display  and  capture.  This  system  let  us  speak  into  a  microphone, 
and  see  real-time  output  of  three  different  neural  representations  on  the  workstation  screen. 
The  workstation  could  also  generate  analog  audio  signals:  by  using  the  workstation  to  play 
digitized  speech  recordings  into  the  converter  system,  we  could  do  large-scale  automated 
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experiments  using  our  analog  hardware. 

Using  this  system,  we  designed  a  speech  recognition  system  for  a  13  word,  speaker- 
independent,  telephone-quality  recognition  task.  We  used  our  analog  hardware  for  speech 
pre-processing,  and  used  the  MLP-HMM  speech  recognition  software  described  earlier  in 
this  report  to  complete  the  system. 

This  hybrid  speech  recognition  system  performs  well  on  this  13- word  task,  yielding  a  4 
percent  error  rate.  Speech  recognition  systems  with  performance  in  this  range  are  usable 
in  commercial  applications,  if  good  error  recovery  strategies  are  employed.  However,  it  is 
dissapointing  to  note  that  the  best  non-neuromorphic  software-based  speech  pre-processing 
do  significantly  better  on  this  task,  yielding  a  2  percent  error  rate.  In  addition,  the  neuro- 
morphic  front  end  does  not  do  well  when  processing  input  speech  corrupted  by  background 
noise,  yielding  very  high  error  rates. 

We  believe  these  difficulties  belie  the  shortcomings  of  joining  neuromorphic  sensory  rep¬ 
resentations  with  traditional  “back-ends”  for  speech  recognition.  These  back-ends  have  been 
optimized  over  several  decades,  by  hundreds  of  researchers,  to  work  well  with  traditional 
signal-processing  front-ends.  It  is  not  surprising  that  such  back-ends  work  relatively  poorly 
if  interfaced  with  a  radically  different  front-end  approach.  To  realize  the  promise  of  the 
neuromorphic  approach  in  speech  recognition,  a  reformulation  of  the  complete  recognition 
process  may  be  in  order. 


2.5.3  Micropower  Speech  Recognition 

The  computing  architectures  used  in  analog  neuromorphic  auditory  processing  have  ex¬ 
traordinary  power  efficiency.  For  example,  a  51-channel  silicon  cochlea  consumes  a  few 
microwatts  of  power,  a  specification  that  supports  years  of  operation  of  a  small  wristwatch 
battery.  Could  complete  speech  recognizers  could  be  built  using  this  circuit  technology,  and 
achieve  a  micropower  dissipation?  If  so,  the  micropower  advantage  would  be  a  sufficient 
reason  to  develop  speech  recognition  systems  using  neuromorphic  front  ends,  even  if  the 
recognition  performance  is  no  better  than  current  software  techniques. 

We  explored  the  architecture  of  a  micropower  speech  recognition  system,  using  the  hy¬ 
brid  speech  recognizer  described  in  the  last  section  as  a  model.  This' system  includes  two 
major  algorithms  which  would  require  analog  micropower  implementation:  neural  networks 
and  hidden  Markov  model  state  decoding.  Several  researchers  have  focused  on  micropower 
neural  network  technology,  with  encouraging  results.  In  the  CNS  project,  we  focused  on 
implementing  hidden  Markov  model  state  decoding  using  micropower  analog  circuits.  This 
research,  done  in  collaboration  with  Richard  Lippmann  at  MIT  Lincoln  Laboratories,  re¬ 
sulted  in  the  design,  fabrication  of  a  functional  prototype  for  Baum- Welch  state  decoding 
with  microwatt  power  consumption. 
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2.6  Parallel  Languages  and  Programming 

The  parallel  programming  aspect  of  the  project  also  made  a  great  deal  of  progress,  but  was 
not  as  closely  integrated  as  we  had  originally  planned.  In  the  early  years,  we  hoped  that  the 
CNS  could  turn  into  a  general  purpose  machine  and  designed  the  parallel  Sather  language 
to  be  compatible  with  the  CNS  design.  This  combined  effort  had  a  positive  effect  on  both 
sub-projects,  but  the  integration  was  not  carried  through  because  it  was  decided  to  make 
CNS  more  specialized  and  to  retain  Sather  as  a  general  purpose  multi-platform  system. 

The  Sather  project  on  Object-Oriented  programming  and  its  parallel  version,  pSather, 
have  both  been  quite  successful.  Many  of  the  innovations  pushed  in  the  Sather  project 
are  now  main  stream  including  automatic  storage  management,  separation  of  inheritance 
and  subtyping  and  robust  abstraction  libraries.  The  widely  used  Java  language  incorporates 
these  and  other  Sather  features  and  some  of  our  students  and  post-docs  are  playing  a  central 
role  in  future  Java  development.  The  Sather  project  trained  four  doctoral  students  and  six 
post-docs  in  addition  to  having  many  shorter  term  visitors  and  collaborators.  This  effort 
has  led  to  basic  advances  in  our  understanding  of  storage  management  involving  caches, 
active  thread  management  and  the  mapping  of  large  irregular  tasks  to  parallel  computers. 


3  Senior  Personnel 

One  of  the  key  senior  staff  providing  critical  and  essential  research  contributions  was  Post- 
Doctoral  Researcher  John  Lazzaro.  John  was  responsible  for  setting  directions  for  the 
research  and  was  solely  responsible  for  the  computational  aspects  of  the  research. 


4  Students 

Students  supported  by  this  grant  who  have  graduated  with  MS  degrees  were:  William 
Chang  (December  1997),  Todd  Hodes  (December  1997),  Nathan  McNamara  (May  1995), 
and  David  Stoutamire  (December  1997). 

Students  who  have  graduated  with  their  Ph.D.  degrees  were:  Krste  Asanovic  (December 
1997),  David  Bailey  (April  1997),  Yochai  Konig  (May  1996),  Srinivas  Narayanan  (May 
1997),  and  Thorsten  von  Eicken  (December  1993). 
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